Word Discrimination Based on Bigram Co-Occurrences

نویسندگان

  • Adnan El-Nasan
  • Sriharsha Veeramachaneni
  • George Nagy
چکیده

Very few pairs of English words share exactly the same letter bigrams. This linguistic property can be exploited to bring lexical context into the classification stage of a word recognition system. The lexical n-gram matches between every word in a lexicon and a subset of reference words can be precomputed. If a match function can detect matching segments of at least n-gram length from the feature representation of words, then an unknown word can be recognized by determining the subset of reference words having an n-gram match at the feature level with the unknown word. We show that with a reasonable number of reference words, bigrams represent the best compromise between the recall ability of single letters and the precision of trigrams. Our simulations indicate that using a longer reference list can compensate errors in feature extraction. The algorithm is fast enough, even with a slow processor, for human-computer interaction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

Lexical Co-occurrence, Statistical Significance, and Word Association

Lexical co-occurrence is an important cue for detecting word associations. We present a theoretical framework for discovering statistically significant lexical co-occurrences from a given corpus. In contrast with the prevalent practice of giving weightage to unigram frequencies, we focus only on the documents containing both the terms (of a candidate bigram). We detect biases in span distributi...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

On-Line Handwriting Recognition Based on Bigram Co-Occurrences

We propose a handwriting recognition method that utilizes the n-gram statistics of the English language. It is based on the linguistic property that very few pairs of English words share exactly the same letter bigrams. This property is exploited to bring context to the recognition stage and to avoid segmentation. The recognition is based on detecting bigram co-occurrences. Even with naive feat...

متن کامل

Image Steganalysis Based on Co-Occurrences of Integer Wavelet Coefficients

We present a steganalysis scheme for LSB matching steganography based on feature vectors extracted from integer wavelet transform (IWT). In integer wavelet decomposition of an image, the coefficients will be integer, so we can calculate co-occurrence matrix of them without rounding the coefficients. Before calculation of co-occurrence matrices, we clip some of the most significant bitplanes of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001